arXiv:1504.04596v2 [cs.IR] 20 Apr 2015 


Structural Learning of Diverse Ranking 


Yadong Zhu Yanyan Lan Jiafeng Guo Xueqi Cheng 

Institute of Computing Technoiogy, Chinese Academy of Sciences, Beijing 100190, China 

{zhuyadong}@software.ict.ac.cn 
{lanyanyan, guojiafeng, cxq}@ict.ac.cn 


ABSTRACT 

Relevance and diversity are both crucial criteria for an ef¬ 
fective search system. In this paper, we propose a unified 
learning framework for simultaneously optimizing both rel¬ 
evance and diversity. Specifically, the problem is formal¬ 
ized as a structural learning framework optimizing Diversity- 
Correlated Evaluation Measures (DCEM), such as ERR-IA, 
a-NDCG and NRBP. Within this framework, the discrimi¬ 
nant function is defined to be a bi-criteria objective maxi¬ 
mizing the sum of the relevance scores and dissimilarities (or 
diversity) among the documents. Relevance and diversity 
features are utilized to define the relevance scores and dis¬ 
similarities, respectively. Compared with traditional meth¬ 
ods, the advantages of our approach lie in that: (1) Directly 
optimizing DCEM as the loss function is more fundamental 
for the task; (2) Our framework does not rely on explicit di¬ 
versity information such as subtopics, thus is more adaptive 
to real application; (3) The representation of diversity as 
the feature-based scoring function is more flexible to incor¬ 
porate rich diversity-based features into the learning frame¬ 
work. Extensive experiments on the public TREC datasets 
show that our approach significantly outperforms state-of- 
the-art diversification approaches, which validate the above 
advantages. 

Categories and Subject Descriptors 

H. 3.3 [Information Search and Retrieval]: Information 
Search and Retrieval - Retrieval Models 

General Terms 

Algorithms, Experimentation, Performance, Theory 
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Relevance and diversity are both critical for user experi¬ 
ences in the real search scenario. On one hand, more relevant 
items should be ranked higher to satisfy users’ information 
need. On the other hand, redundant information should be 
reduced to satisfy users’ diverse information need. Recently, 
many diversity-correlated IR evaluation measures have been 
proposed for evaluating the search systems from the two as¬ 
pects, such as ERR-IA [^|^, a-NDCG and NRBP [10| , 
which try to achieve a balance between relevance and diver¬ 
sity. 

To fulfill the requirements of both relevance and diversity, 
many diversity-enhancement methods have been developed, 
and can be mainly divided into two categories: implicit and 
explicit methods [^. Implicit methods such as MMR 
conduct greedy process based on heuristic defined objectives 
to select documents. While explicit methods such as the 
work in , directly diversify search results based on the 

subtopic information of user queries, and then greedily select 
documents according to their predefined utility functions. 

However, there are some disadvantages with these ap¬ 
proaches: (!) the objectives of implicit methods are mainly 
heuristic, and it is not clear about the relations between 
them and the DCEM measures; (2) the diversity in explicit 
methods is achieved through the representation of subtopic 
information, thus is very easy to introduce bias in the esti¬ 
mation process of subtopics; (3) these approaches often uti¬ 
lize a predefined utility function, and thus limited features 
can be incorporated for capturing relevance and diversity 
properly. 

In order to tackle the above challenges, we propose a 
unified learning framework to simultaneously optimize rel¬ 
evance and diversity in this paper. Firstly, we formalize 
the problem as a structural learning framework, in which 
the objective functions are directly defined as the diversity- 
correlated IR evaluation measures, such as ERR-IA, a-NDCG 
and NRBP. Secondly, we define the discriminant function 
as a bi-criteria objective, which maximizes the sum of the 
relevance scores and dissimilarities among the documents. 
Thirdly, we propose a bunch of features to capture rele¬ 
vance and diversity. For relevance, the traditional relevance 
features used in learning-to-rank literature are adopted; 
for diversity, a series of diversity features are utilized, such 
as dissimilarities of implicit topics, titles, texts, links, urls. 

To evaluate the effectiveness of the proposed approach, we 
conduct extensive experiments on the public TREC datasets. 
The experimental results show that our methods can sig¬ 
nificantly outperform the state-of-the-art diversification ap- 


proaches with the evaluation of ERR-IA, a-NDCG and NRBP. 
Furthermore, our methods also achieve best in the evalua¬ 
tions of traditional intent-aware measures, i.e. Precision-IA 
and Subtopic recall. In addition, we give some discussions 
on the robustness of our methods and the importance of the 
proposed diversity features. Finally, we also study the ef¬ 
ficiency of our approach based on the analysis of running 
time. 

The main contributions of this paper lie in: 

• the proposal of a unified learning framework to simul¬ 
taneously optimize both relevance and diversity. 

• the definition of the discriminant function as a bi¬ 
criteria objective. 

• the proposal of rich useful diversity-based features. 

• a thorough experimental evaluation of the proposed 
approach and numerous baseline methods. 

The rest of the paper is organized as follows. Section 2 de¬ 
scribes the related work on search result diversification. Sec¬ 
tion 3 formulates the learning problem of diversity-combined 
rankings. Section 4 describes the formulation of discrimi¬ 
nant function based on a bi-criteria objective, and presents 
a series of useful diversity features. Section 5 describes the 
training procedure based on the structural SVM framework. 
Section 6 contains the experimental setup and results. Sec¬ 
tion 7 presents our concluding remarks. 


2. RELATED WORK 

In this section, we review the research work on search di¬ 
versification. In general, they can be divided into two cate¬ 
gories: diversity-correlated methods and diversity-correlated 
evaluation measures. We will introduce more detailed infor¬ 
mation in the following. 


2.1 Diversity-Correlated Methods 

Diversity-correlated methods can be mainly divided into 
two categories: implicit approaches and explicit approaches 
[27| . The implicit methods assume that similar documents 
cover similar aspects and model inter-document dependen¬ 
cies. For example. Maximal Marginal Relevance (MMR) 
method proposes to iteratively select a candidate docu¬ 
ment with the highest similarity to the user query and the 
lowest similarity to the already selected documents, in order 
to promote novelty. In fact, most of the existing approaches 
are somehow inspired by the MMR method. Zhai et al. 
select documents with high divergence from one language 
model to another based on the risk minimization consid¬ 
eration. The explicit methods explicitly model aspects of 
a query and then select documents that cover different as¬ 
pects. The aspects of a user query can be achieved with 
a taxonomy [ij |29| |32] , top retrieved documents 0 , query 
reformulations |20| j28j , or multiple external resources [13| . 
Overall, the explicit methods have shown better experimen¬ 
tal performances comparing with implicit methods. 

There are also some other methods which attempt to bor¬ 
row theories from economical or political domains [M 
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|11| . The work in [33[ |22| applies economical portfolio the¬ 
ory for search result ranking, which views search diversifica¬ 
tion as a means of risk minimization. The approach in [TT] 


treats the problem of finding a diverse search result as find¬ 
ing a proportional representation for the document ranking, 
which is like a critical part of most electoral processes. 

Recently, some researchers have proposed to utilize ma¬ 
chine learning techniques to solve the diversification prob¬ 
lem. Yue et al. |36| propose to optimize subtopic coverage 
as the loss function, and formulate a discriminant function 
based on maximizing word coverage. However, their work 
only focuses on diversity, and discards the requirements of 
relevance. They claim that modeling both relevance and di¬ 
versity simultaneously is a more challenging problem, which 
is exactly what we try to tackle in this paper. Yadong et al. 
[38| propose a R-LTR model to solve the diversification prob¬ 
lem, which comes from a sequential ranking process. Our 
work in this paper try to solve the diversification problem 
from a discriminative view, which is a unified framework. 
The authors of [2 23 try to construct a dynamic ranked- 
retrieval model, which may be useful in the user interface 
designing of future retrieval system. Our paper focuses on 
the common static ranking scenario, which is different from 
their papers. 

There are also some on-line learning methods that try to 
learn retrieval models by exploiting user click data or im¬ 
plicit user feedback 


21 


30 


24 


These research work can 
tackle diversity problem to some extent, but they focus on 
an ‘on-line’ or ‘interactive’ scenario, which is different from 
our work. For example, Raman et al. [24| propose an on¬ 
line algorithm that presents a ranking to users at each step, 
and observes the set of documents the user reads in the pre¬ 
sented ranking, and then updates its model. While in our 
work, we try to utilize human labels conveying relevance and 
subtopic information to learn an optimal ‘off-line’ retrieval 
model. The most representative scenario is the diversity task 
of Web Track in TREC. In fact, the two modes (i.e. ‘off-line’ 
and ‘on-line’) are complementary in practical applications. 
People usually utilize historical human labeled data to train 
an optimal retrieval model, and then use on-line user feed¬ 
back to update the retrieval model dynamically. We may 
investigate on-line algorithms in our future work. 

In this paper, we propose to utilize machine learning tech¬ 
niques to simultaneously optimizing both relevance and di¬ 
versity based on a bi-criteria objective, which is different 
from traditional learning-based approaches and shows promis¬ 
ing experimental performance. 


2.2 Diversity-Correlated Evaluation Measures 

The proper evaluation measures are very important to the 
diversity problem. They usually act as objective functions 
to be optimized by retrieval systems. However, traditional 
evaluation measures cannot well capture the diversity prop¬ 
erty. Therefore, several research studies on diversity evalu¬ 
ation measure have been proposed. 


In the early stage, Zhai et al. 37 define a number of 


subtopic recall metrics to measure diversity. Recently, many 
evaluation measures based on cascade models have been pro¬ 
posed, such as a-NDCG [^, ERR-IA [^, and NRBP [10| . 
They measure the diversity of a result list by explicitly re¬ 
warding novelty and penalizing redundancy observed at ev¬ 
ery rank. In the meantime, Agrawal et al. also propose 
a series of intent-aware versions of the traditional measures, 
such as MAP-IA, Precision-IA. The traditional measures are 
applied to each subtopic independently and then combined 
together. More recently, Sakai and Song compare a wide 












range of diversified IR metrics, and propose a series of D# 
measures which have high discriminative power 
terestingly, a novel proportionality measure called CPR (Cu¬ 
mulative Proportionality measure) has been proposed [11| , 
which captures proportionality in search results. 

Overall, how to evaluate diversity properly is still an in¬ 
teresting research problem. The current official evaluation 
measures of TREC diversity task are ERR-IA, a-NDCG and 
NRBP [^, which are also the main objectives to be opti¬ 
mized in our work. As summarized in [^, they are all based 
on cascade models and have the same nature. Moreover, the 
ERR-IA measure enables graded relevance values more than 
binary relevance. 


26 


m. In- 


3. THE LEARNING PROBLEM 

Following the practice of machine learning, our goal is to 
learn a hypothesis function h ■. X ^ y between an input 
space X and output space y. Here X denotes the space of 
possible candidate sets x, y denotes the space of predicted 
rankings y. In order to quantify the quality of a prediction 
y = h(x), we will consider a loss function A : 3^ x (y —> 5R. 
A(y^*^ 1 y) quantifies the penalty of prediction y if the correct 
output is y^*^ for given x^'\ 

We restrict ourselves to the supervised learning scenario. 
Given a set of training examples S = y*'”^) ^ X x y : 

i = l,...,n}, the learning strategy is to find a function h 
which minimizes the empirical risk defined as: 


1 " 
n ' 

i — l 


In the case of learning a diverse ranking, we define the 
loss based on the diversity correlated evaluation measures 
(DCEM) as follows: 


ADCEM{y^'\y) = 1 - 


DCEM{y) 

DCEM{y(i)) 


( 1 ) 


In this paper, we mainly consider three diversity corre¬ 
lated evaluation measures: ERR-IA, a-NDCG and NRBP, 
which are the current official evaluation measures of TREC 
diversity task [^. The corresponding diversity losses are 
denoted as Aerr-ia, Ac-ndcg and Anrbp, respectively. 
Without confusion, DCEM stands for the three measures 
hereafter. Table ?? provides a general view of them. Their 
detailed explanation information can be referred to the cor¬ 
responding literature 

Taking a-NDCG for example, a-NDCG is formulated as 
follows: 


a-NDCG 


^ hi ^ hi / 052 (fe + l) 


where is a binary relevance value for document at postion 
k with respect to subtopic i, a is a constant belong to (0,1], 
JZ]* gi , which is the number of documents ranked 
before position k that are judged relevant to subtopic i, K is 
the number of documents in a ranking list, M is the number 
of subtopics. Pi is the probability of each subtopic, and M 
is a normalization factor. 

The above learning framework requires the knowledge of 
y*-*^ to use as training data. However, such y*-*^ are not 
always provided in existing public data sets. Taking the 
TREC diversity task as an example, the original labeled 


Algorithm 1 Training Data Construction via Greedy Selection 

Input: 

(gW^x«,T«,P(a;f |t) £ x^) 

Output: y(‘^ 

1: Initialize solution y^*^ 0 

2: for k = 1, ...,K do 

3: bestDoc -P- argmax_j-^g,^(i)^y{i) j DCEM{y^’^^ © d) 

4 : y(®) y(») u bestDoc 

5: end for 
6: return y^*^ 


data are provided in the form of: x^®\ |t) : 

t e where x^'^ is a candidate document 

set of the query is the subtopic set of query g^*\ 

f is a specific subtopic in and P{x^^'^\t) describes the 
relevance of document to subtopic t. 

Due to the non-convexity property of DCEM measures, it 
is NP-hard to find the optimal output y^*^ with the maxi¬ 
mal DCEM value. Therefore, we turn to a greedy selection 
process as described in Algorithm 1 to construct y^'^ , which 
can be viewed as an approximate optimal output. The op¬ 
erator © denotes adding a document to the already selected 
set. According to the results in [16[ |18| , if a submodular 
function is monotonic (i.e., f{S) < f{T), whenever S C P) 
and normalized (i.e., f{<j>) = 0), greedily constructing a set 
of size K gives an (1 — l/e)-approximation to the optimal. 
Since any member of DCEM is a submodular function, we 
can prove that Algorithm 1 is (1 — l/e)-approximation to 
the optimal (we omit the proof here). Therefore, the quality 
of the training data can be guaranteed in theory. 

4. DISCRIMINANT FUNCTION 

We focus on hypothesis function /i(-;w) which is param¬ 
eterized by a weight vector w, and thus wish to hnd w to 
minimize the empirical risk, Pg (w) = Pg (h(.; w)). Our ap¬ 
proach is to learn a discriminant function E ■. X x 3^ —>■ 5R, 
which can measure the quality of the predicted ranking y 
for X. Given x, we can derive a prediction by finding the 
diverse ranking y that maximizes E: 

/i(x; w) = argmax P(x, y; w). (n) 

v&y > 

A proper discriminant function should be with strong dis¬ 
criminative power between high quality and low quality pre¬ 
dictions. Different retrieval settings may determine different 
discriminant functions. In this section, we will try to dehne 
proper discriminant function for diverse ranking. 

4.1 Formulation of Bi-criteria Objective 

Here we first analyze our objective, the member of DCEM 
measures summarized in Table 1, which is the basis of our 
diversity loss. These measures have the same nature, and 
are different in some tiny components such as the way of 
position discounting. We find that there are 2 key points 
in these measures: diversity and the gain. The diversity 
means intent (or subtopic) coverage, which is based on ex¬ 
plicit subtopic information of a query. Specific to a certain 
subtopic, the gain describes redundancy penalizing and po¬ 
sition discounting when accumulating the relevance in ev¬ 
ery rank. The gain of a specific document must be firstly 










Table 1: Summary of typical DCEM measures (t] 


diversity 

novelty 

gain 

discount 

measure 


, 

or simplified to 

Dk = log(k -\- 1) 

a-NDCG 

II 

C 

“ Z^fc-1 Dk 

II 

ERR-IA 


Q? = 5?(1 - 

Dk = 

NRBP 



Figure 1: An Example of Ranking Prediction. All 
the triangles represent candidate documents of 
a query, and the {A, B, C} sets with different colors 
represent different snbtopics (denoted as The 

solid triangle in each set is relevant to the user query, 
and the hollow triangle is irrelevant to the query. 
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|a) Relevance 

(b) Diversity 

(c) Bi-criteria 


Figure 2: Solution examples for Fig.l based on dif¬ 
ferent criteria 


based on its relevance degree. In general, the final values of 
DCEM measures can be viewed as a comprehensive consid¬ 
eration of explicit diversity information and basic relevance 
score. Therefore, we define our discriminant function as a 
bi-criteria objective to take both relevance and diversity into 
consideration. 

Figure 1 is a reduced example to illustrate our prediction 
problem. If T^'^ were known, we could use Algorithm 1 to 
find a solution with high DCEM value. For A = 3, the op¬ 
timal solution in Figure 1 is y^®^ = {A 3 , B1IB2, C1/C2/C3}. 

In general however, the T*®^ were unknown. Instead we 
assume that the candidate set contains a set of discrimina¬ 
tive features that can separate subtopics from each other, 
and reflect the relevance degree of each document. For ex¬ 
ample, we can use topic models to model implicit topics of 
documents and consider distances between document pairs 
based on implicit topics. If the relevance of each document 
is specified by a weight function tc(-), the distance (or diver¬ 
sity) function between document pairs is specified as: d(-, •), 
and the set selection function is denoted as: /{■), then a 
natural bi-criteria objective is to maximize the sum of rele¬ 
vance and dissimilarity of the selected set. It can be simply 
defined as follows: 

m = Ai ^ w{u) + X 2 ^ d{u,v) (3) 

uGS u,v^S 

where S is the solution set, and Ai > 0, A 2 > 0, which are 
parameters for trade-off. 

This type of bi-criteria objective has strong discrimina¬ 
tive power between high and low quality predictions, which 
are with high and low DCEM values, respectively. For ex¬ 
ample, supposing all the candidate documents with binary 
relevance, i.e. 0 or 1 , and the distance between document 
pairs is 1 if they belong to different subtopics, else is 0 . 
We assign both values of Ai and A 2 as 1 for simplification. 
Then we can get a optimal solution with the maximal value 
(i.e., f{S) = 6 ), as shown in Figure 2(c), which is the same as 
the optimal solution achieved based on DCEM. If we choose 
other solution based on sole criterion such as relevance or 
diversity, the solution will be with lower value of f{S), such 
as Figure 2(a) and Figure 2(b) (i.e., f{S) = 3 and f{S) — 4, 
respectively), and they are also with lower DCEM value at 
the same time. 

In fact, the bi-criteria objective f{S) shares similar insight 
as the work in [^, without knowledge of this work. De¬ 
spite sharing some similarities, the details of two work differ 
greatly, with their work mainly giving a generic theoretical 
analysis for a generic setting (e.g., properties of NP-hard, 
sub modularity and monotonicity.), while our work present¬ 
ing a structural learning framework for jointly modeling rel¬ 
evance and diversity based on the bi-criteria objective. 

4.2 Definition of Discriminant Function 

We assume F to be linear in a combined feature represen- 

















tation 'i/ : X X y ^ K”", which can be denoted as 

F(x,y;w) = w^'I'(x,y). (4) 

As a proxy for maximizing DCEM values, we then for¬ 
mulate our discriminant function based on this type of bi¬ 
criteria objective (i.e., Equation]^ as following: 

w^5'(x,y) = ^ w^V’r(x,y)-b ^ wj'!/’d(x, y) (5) 
rey deyxy 


where tpr (x, y) denotes the independent document feature 
vector describing the relevance of a single document. The 
relevance feature vector ^^(x, y) contains all the standard 
features traditionally adopted in the learning-to-rank liter- 
such as: the standard weighting models, held- 
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based models, term dependence models, link analysis and 
URL features. 'i/)d(x, y) denotes the diversity feature vec¬ 
tor describing the dissimilarity between di and dj pair in 
y, which contains a set of discriminative features that can 
separate subtopics from each other, in order to capture di¬ 
versity effectively, and wj stands for the corresponding 
weight vector for relevance and diversity. 

Obviously, the first part of Equation is a relevance dis¬ 
criminant function, and the second part is a diversity dis¬ 
criminant function. Two kinds of discriminant functions are 
combined together under a bi-criteria objective in a struc¬ 
tural learning model. 


4.3 Diversity Features 

How to define powerful features that can well capture di¬ 
versity is non-trivial and critical for the success of the learn¬ 
ing framework. Our work only provides a general direction, 
and presents some representative features used in our work. 

Topic Model Diversity. Topic models, such as the 
probabilistic latent semantic analysis (pLSA) [^, are com¬ 
monly used to model implicit topics associated with a set of 
documents. For diversity problem, it is necessary to asso¬ 
ciate the implicit subtopics to the documents to be ranked, 
and then the relations between subtopics can be converted 
to the relations between documents. 

For a training set, we apply pLSA on the candidate sets 
to get the implicit subtopic distribution. Then we can define 
the diversity feature based on implicit topics as following: 


^dtapia (*^*1 ) 


m 




{pizk\di) -p{zh\dj)y 

fc=i 


Text Diversity. We can compute the text dissimilarity 
by TFIDF cosine based on vector space model (VSM), and 
defined as following: 

l|di|| ||dj |j 

where di,dj is the weighted document vector based on TFIDF 
weight. There also exists other computing ways such as the 
work in [^, which is based on sketching algorithm and Jac- 
card similarity. 

Title Diversity. The way of computing title diversity is 
the same as text diversity. We list it separately here mainly 
due to they belong to different document fields. We noted 
it as <j>dutiAdi.dj). 

Anchor Text Diversity. The anchor text can accurately 
describe the content of corresponding page. Therefore, it is 


also an important field for a document. The way of comput¬ 
ing anchor text diversity is the same as text and title. We 
noted it as 4>d^„^^^^idi,dj). 

ODP-Based. The existing ODP taxonomjj^ offers a suc¬ 
cinct encoding of distances between pages (or documents). 
Usually, the distance between pages on similar topics in the 
taxonomy is likely to be small. For two categories u and v, 
we define the categorical distance between them as following: 


dis(u, u) = 1 — 


max{|u|, |u|} 


where l{u,v) is the length of their longest common pre¬ 
fix. |u| and |n| is the length of category u and v. For 
instance, given two categories: ‘Arts/Movies/Awards/’ and 
‘Arts/Movies/Filmmaking/Directing/Directors/’, their dis¬ 
tance is 3/5, since they share the common prefix ‘Arts/Movies’ 
and the length of the longest category is 5. Then given two 
documents di and dj and their category information sets 
Ci and Cj respectively, we define the ODP-based diversity 
feature as: 


4’dadp {di, dj) 


\cr]c;\ 


where \Ci\ and \Cj\ are the number of categories in corre¬ 
sponding category sets. 

Except the semantic diversity based on the dissimilarity of 
document content, we also can define the diversity features 
from a non-semantic aspect, such as url information, or web 
link structure graph. 

Link-Based. By constructing a web link graph, we can 
calculating the link similarity of any document pair based on 
direct inlink or outlink information. The link-based diversity 
feature is then dehned as follows: 


4‘dii„i, {di, dj) 


0 urli £ inlink{dj) U outlink{dj), v.v. 
1 other cases 


URL-Based. Given the url information of two docu¬ 
ments, we can judge whether they belong to the same do¬ 
main or the same site. Then we can simply define the url- 
based diversity feature as follows: 

{ 0 one url is another’s prefix 
0.5 belonging to the same site or domain 
1 other cases 

Moreover, there also exists other useful resource informa¬ 
tion for the definition of diversity features, such as click¬ 
through logs. The information of user clickthrough log is 
very important, and we will take it into consideration in 
future. 


5. TRAINING WITH STRUCTURAL SVM 

Structural SVM has been shown to be robust and effective, 
when solving complex learning problem with non-smooth 
ranking loss in information retrieval [31[ . In this pa¬ 

per, we use structural SVM to learn the weight vector w. 


Optimization Problem 1 (STRUCTURAL SVM). 
^http://www.dmoz.org/ 









min 

w,^i>0 



+ 


C 

n 




s.t. yi,yy ^ y\ : 


( 6 ) 


w > w y) + AncEMiy^"' ,y) - ^ (7) 

The objective function (6) to be minimized is a trade-off 
between model complexity: |jw||^, and a hinge loss relax¬ 
ation of the training loss for each training example: 

In SVM training, parameter C controls the trade-off and can 
be turned to achieve good performance for different training 
tasks. The is the optimal solution that can be chosen 
via greedy selection as the Algorithm and minimizes 
A_DCEM(y^*^y), as the definition in Equation]^ 

For each in the training set, a set of constraints 

is added to the optimization problem as the form in Equa¬ 
tion (7), and the number is exponential. Despite the large 
number of constraints, we can employ Algorithm 2 to solve 
OP 1. Algorithm 2 is a cutting plane algorithm, iteratively 
adding constraints until we have solved the original problem 
with a desired tolerance e 31 . The algorithm starts with no 


constraints, and iteratively finds for each training example 
, y*-*^), the output y associated with the most violated 
constraint. If the corresponding constraint is violated more 
than e, we add y into the working set Wi of active con¬ 
straints for sample i, and re-solve (5) using the updated W. 
It has been shown that Algorithm 2’s outer loop is guaran¬ 
teed to halt with a polynomial number of iterations for any 
desired precision e [31| . 

Within the inner loop of Algorithm 2, we have to compute 
argmaxygyi7(y; w), where 

77(y;w) = AocEMiy^'^'’ 


, y)-f-w^«'(x('\ y)-w^^(x'-', y 


(i) 


or equivalently, 

argmaxAocBM(y‘^‘\y) + w^’I'(x^*\y) (8) 

yey 

In fact, solving Equation Q exactly is intractable, and an 
approximate method can be easily applied as Algorithm 1. 
Despite using an approximate constraint generation method, 
SVM training is still known to terminate in a polynomial 
number of iterations. Moreover, in practice, training pro¬ 
cedure typically converges much faster than the worst case 
considered by the theoretical bounds [^, and we will eval¬ 
uate it empirically in the following sections. 

Once the weight vector w is obtained, the prediction pro¬ 
cedure can be made via Equation]^ by employing a greedy 
selection approach as Algorithm 1, with using w^5'(x^'^,y) 
to replace the corresponding DCEM measure, and iteratively 
selecting the document with the highest marginal gain. 


6. EXPERIMENTS 

In this section, we evaluate the effectiveness of our ap¬ 
proach empirically. In particular, we compare against a se¬ 
ries of popular diversification approaches using official TREC 
diversity measures and traditional diversity measures. Fur¬ 
thermore, we analyze the performance robustness of different 
diversification approaches. In addition, we study the effect 
of our approximate constraint generation in training proce¬ 
dure, and analyze the importance of our proposed diversity 
features. Finally we study the efhciency of our approach 
based on the analysis of running time. 


Algorithm 2 Cutting Plane Algorithm for Solving OP 1 
with tolerance e __ 

Input: (x(i);y(i)y,::;(x(");y("));cr^ . 

1 : Wi 0 for all i = 1, ..,n 
2: repeat 

3: for i = 1,..., n do 

4: 77(y;w) = ADCEM{y^"\y) + w^4'(x^*\y) - 

5: compute y = argmaXj^gj,7f(y; w) 

6 : compute ^i = max{0, maxygvVi77(y; w)} 

7: if 77(y; w) > ^i -I- £ then 

8: W. ^ W. U {y} 

9: w optimize (5) over W = IJi Wi 

10: end if 

11: end for 

12: until no Wi has changed during iteration 


6.1 Experimental Setup 

Here we give some introductions on the experimental setup, 
including data collections, evaluation metrics, baseline mod¬ 
els and experiment design. 

Data Collections. Our experiments are conducted in 
the context of the diversity task of the TREC2009 Web 
Track (WT2009), TREC2010 Web Track (WT2010), and 
TREC2011 Web Track (WT2011), which contain 50, 48 and 
50 test queries (or topics), respectively. Each topic includes 
several subtopics identified by TREC assessors, with binary 
relevance judgements provided at the subtopic leve|^ Our 
evaluation is done on the ClueWeb09 Category B data col- 
lectiorj^ which comprises a total of 50 million English Web 
documents. 

Evaluation Metrics. The current official evaluation 
metrics of the diversity task include ERR-IA, a-NDCG and 
NRBP. They implement a cascade user model which pe¬ 
nalizes redundancy by assuming an increasing probability 
that users will stop inspecting the results as they find their 
desired information. Additionally, we also use traditional di¬ 
versity measures for evaluation, i.e., Precision-IA and Subtopic 
recall. They measure the precision across all subtopics of the 
query and the ratio of the subtopics covered in the results, 
respectively. All the measures are computed at rank cutoff: 
20. Moreover, the associated parameters a and /3 are all set 
to be 0.5, which is consistent with the default settings in 
official TREC evaluation program. 

Baseline Models. To evaluate the performance of our 
approach, we compare our approach with the state-of-the- 
art approaches, which are introduced as follows. 

• QL. The standard Query-Likelihood language model is 
used for conducting the initial retrieval, which provides 
the top 1000 retrieved documents as a candidate set for 
all the diversification approaches. It is also chosen as 
a basic baseline method in our experiment. 

• MMR. MMR is a classical implicit diversity method 
in the diversity research. It employs a linear combi- 

^In fact, for WT2011 task, assessors made graded judge¬ 
ments. While in the official TREC evaluation program, it 
mapped these graded judgements to binary judgements by 
treating values > 0 as relevant and values < 0 as not rele¬ 
vant. 

^http://boston.lti.cs.cmu.edu/Data/clueweb09/ 










Table 2: Relevance Features for learning on 


C lueWeb09-B collection |19|, |17|. 


Category 

Feature Description 

Total 

Q-D 

TF-IDF 

5 

Q-D 

BM25 

5 

Q-D 

QL.DIR 

5 

Q-D 

MRF 

10 

D 

PageRank 

1 

D 

Inlink number 

1 

D 

Outlink number 

1 


nation of relevance and diversity as the metric called 
“marginal relevance” [^. MMR will iteratively select 
document with the largest “marginal relevance”. 


• xQuAD. The explicit diversification approaches are 
popular in current research field, in which xQuAD is 
the most representative and used as a baseline model 
in our experiments [28|. 


PM-2. PM-2 is also a explicit method that proposes 
to optimize proportionality for search result diversih- 
It has been proved to achieve promising 
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cation 

performance in their work, and is also chosen as a base¬ 
line method in our experiment. 


• ListMLE. ListMLE is a plain learning-to-rank ap¬ 
proach without diversification considerations, yet it is 
the state-of-the-art listwise relevance approach in LTR 
field 35 . We use it as a basic supervised baseline. 


• SVMDIV. SVMDIV is a representative supervised 
approach for search result diversification |36| . It pro¬ 
poses to optimize subtopic coverage by maximizing 
word coverage. It formulates the learning problem and 
derives a training method based on structural SVMs. 
However, SVMDIV only models diversity and discards 
the requirement of relevance. For fair performance 
comparison, we will firstly apply ListMLE to do the 
initial ranking to capture relevance, and then use SVM¬ 
DIV to re-rank top-K retrieved documents to capture 
diversity. 


In fact, the above three diversity baselines: MMR, xQuAD 
and PM-2, all require a prior relevance function to imple¬ 
ment their diversihcation steps. In our experiment, we choose 
QL as the relevance function for them, and obtain three 
unsupervtsed-relevance versions of diversification baselines: 
MMRq_l, xQuADql and PM-2 ql, respectively. Meanwhile, 
we also apply ListMLE as the relevance function to imple¬ 
ment them, and obtain three supervised-relevance versions: 
MMRiist, xQuADiist and PM-2;ist, respectively. 

With different evaluation metrics used as the objectives, 
our SYMucem approach has 3 variants as described in sec¬ 
tion 3, denoted as: SYMerr-ia, SVMa-jvDCG and SVMjvhbp, 
respectively. 

Experiment Design. In our experiments, we use Indri 
toolkit (version 5.2(j^as the retrieval platform. For the test 
query set on each dataset, we use a 5-fold cross validation 
with a ratio of 3:1:1, for training, validation and testing. 
The final test performance is reported as the average over 
all the folds. 

^ ht t p: / / lemurpro j ect.org/indri 


For data preprocessing, we apply Porter stemmer and 
stopwords removing for indexing and query processing. We 
then extract features for each dataset as follows. For rele¬ 
vance, we use several standard features in learning-to-rank 
research [^, such as typical weighting models (e.g., TF- 
IDF, BM25, LM), and term dependency model (e.g., MRF), 
as summarized in Table|^ where Q-D means that the feature 
is dependent on both the query and the document, D means 
that the feature only depends on the document. For all the 
Q-D features, they are applied in five fields: body, anchor, 
title, URL and whole document. Additionally, the MRF has 
two types of values: ordered phrase and unordered phrase 
[17| , so the total features number is 10. For diversity, we 
use both semantic and non-semantic diversity features de¬ 
scribed before (e.g., <l>d„j,^, 

4‘d^^i)- For the sake of efficiency, we only consider 
the top 100 values of each type of diversity feature for each 
document, and the other values are set to be zero. Finally, 
all feature values are normalized to the range of [0,1]. 

For three baseline models: MMR, xQuAD and PM-2, they 
all have a single parameter A to tune. We perform a 5- 
fold cross validation to train A through optimizing ERR-IA. 
Additionally, for xQuAD and PM-2, the official subtopics 
are used as a representation of taxonomy classes to simu¬ 
late their best-case scenarios, and uniform probability for 
all subtopics is assumed, as described in their work 
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For ListMLE, SVMDIV and our approach, we utilize the 
same training data generated by Algorithm 1 (where ERR-IA 
is chosen as the corresponding DCEM measure for optimiz¬ 
ing), and conduct 5-fold cross validation. ListMLE adopts 
the relevance features summarized in Table ??. SVMDIV 
adopts the representative word level features with different 
importance criteria, as listed in their paper and released 
code 36 . As described in above subsection, SVMDIV will 


re-rank top-iL retrieved documents returned by ListMLE. 
We test K £ {30,50,100}, and hnd it performs best at 
K — 30. Therefore, the following results of SVMDIV are 
achieved with K = ZQ. 

For SVMDIV and our SVMbcbm, the C parameter of the 
SVM is varied from 10“"^ to 10^. The best C value is chosen 
based on the performance of validation set. 


6.2 Performance Comparison 

We now compare our approaches to the baseline models 
on search result diversihcation. The results of performance 
comparison are shown in Table iH and we have the fol¬ 
lowing observations. 

(1) Regarding the comparison among representative im¬ 
plicit and explicit approaches, explicit methods (i.e. xQuAD 
and PM-2) show better performance than the implicit method 
(i.e. MMR) in terms of all the evaluation measures. MMR 
is the least effective due to its simple predehned “marginal 
relevance”, which tries to capture novelty only based on 
inter-document similarity. The two explicit methods achieve 
comparable performance: PM-2iist wins on WT2010 and 
WT2011, while xQuADjist wins on WT2009, but their over¬ 
all performance differences are small. 

Besides, for all these methods, the supervised-relevance 
versions (i.e. MMRjist, xQuAD^st and PM-2Hst) are all su¬ 
perior than their corresponding unsupervised-relevance ver¬ 
sions (i.e. MMRqb, xQuADqb and PM-2ql)- The results 
indicate the importance of the prior relevance function, which 
is required for implementing their diversification steps. By 





















Table 3: Performance comparison of all methods in official TREC diversity measures for WT2009. The 
numbers in the parentheses are the relative improvements compared with the baseline method QL. Boldface 
indicates the highest scores among all runs. 



ERR-IA 

a-NDCG 

NRBP 

QL 

0.1637 

0.2691 

0.1382 

MMRql 

0.1625 

(-0.73%) 

0.2658 (-1.23%) 

0.1361 

(-1.52%) 

xQuADql 

0.1922 

(-tl7.41%) 

0.3093 (-tl4.94%) 

0.1674 

(-t21.13%) 

PM-2ql 

0.1835 

(-tl2.10%) 

0.2896 (-t7.62%) 

0.1607 

(-tl6.28%) 

ListMLE 

0.1913 

(-tl6.86%) 

0.3074 (-tl4.23%) 

0.1681 

(-^21.64%) 

MMRiist 

0.2022 

(-t23.52%) 

0.3083 (-tl4.57%) 

0.1615 

(-H6.86%) 

xQuADjist 

0.2316 

(-t41.48%) 

0.3437 (-t27.72%) 

0.1956 

(-t41.53%) 

PM-2Hst 

0.2294 

(-t40.13%) 

0.3369 (-t25.20%) 

0.1788 

(-t29.38%) 

SVMDIV 

0.2408 

(-t47.10%) 

0.3526 (-t31.03%) 

0.2073 

(-t50.00%) 


0.2613 

(-t59.62%) 

0.3726 (-t38.46%) 

0.2195 

(-t58.83%) 

SVMc-jvdcg 

0.2597 

(-t58.64%) 

0.3765 (-t39.91%) 

0.2192 

(-t58.61%) 

SVM iVHBP 

0.2589 

(-t58.16%) 

0.3712 (-t37.94%) 

0.2223(-t60.85%) 


Table 4: Performance comparison of all methods in official TREC diversity measures for WT2010. The 
numbers in the parentheses are the relative improvements compared with the baseline method QL. Boldface 
indicates the highest scores among all runs. 



ERR-IA 

a-NDCG 

NRBP 

QL 

0.198 

0.3024 

0.1549 

MMRqb 

0.2062 

(+4.14%) 

0.3150 (+4.17%) 

0.1647 

(+6.33%) 

xQuADqb 

0.2583 

(+30.45%) 

0.3882 (+28.37%) 

0.2160 

(+39.44%) 

PM-2q_l 

0.2579 

(+30.25%) 

0.3907 (+29.20%) 

0.2166 

(+39.83%) 

ListMLE 

0.2436 

(+23.03%) 

0.3755 (+24.17%) 

0.1949 

(+25.82%) 

MMRiist 

0.2735 

(+38.13%) 

0.4036 (+33.47%) 

0.2252 

(+45.38%) 

xQuADjist 

0.3278 

(+65.56%) 

0.4445 (+46.99%) 

0.2872 

(+85.41%) 

PM-2Kst 

0.3296 

(+66.46%) 

0.4478 (+48.08%) 

0.2901 

(+87.28%) 

SVMDIV 

0.3331 

(+68.23%) 

0.4593 (+51.88) 

0.2934 

(+89.41%) 

SVM_b_r_r-/a 

0.3546 

(+79.09%) 

0.4723 (+56.18%) 

0.3097 

(+99.94%) 

SYMc-ndcg 

0.3521 

(+77.83%) 

0.4764 (+57.54%) 

0.3086 

(+99.23%) 

SVMnrbp 

0.3514 

(+77.47%) 

0.4718 (+56.02%) 

0.3116 

(+101.16%) 


Table 5: Performance comparison of all methods in official TREC diversity measures for WT2011. The 
numbers in the parentheses are the relative improvements compared with the baseline method QL. Boldface 
indicates the highest scores among all runs. 



ERR-IA 

a-NDCG 

NRBP 

QL 

0.3520 

0.4531 

0.3123 

MMRql 

0.3534 

(+0.40%) 

0.4612 (+1.79%) 

0.3205 

(+2.63%) 

xQuADql 

0.4231 

(+20.20%) 

0.5268 (+16.27%) 

0.3991 

(+27.79%) 

PM-2Qi:, 

0.4319 

(+22.70%) 

0.5334 (+17.72%) 

0.4062 

(+30.07%) 

ListMLE 

0.4172 

(+18.52%) 

0.5169 (+14.08%) 

0.3887 

(+24.46%) 

MMRim 

0.4284 

(+21.70%) 

0.5302 (+17.02%) 

0.3913 

(+25.30%) 

xQuADiist 

0.4753 

(+35.03%) 

0.5645 (+24.59%) 

0.4274 

(+36.86%) 

PM-2;ist 

0.4873 

(+38.44%) 

0.5786 (+27.70%) 

0.4318 

(+38.26%) 

SVMDIV 

0.4898 

(+39.15%) 

0.5910 (+30.43%) 

0.4475 

(+43.29%) 

SVM_Bi{_R-/A 

0.5132 

(+45.80%) 

0.6137 (+35.44%) 

0.4683 

(+49.95%) 

SVMc-jvdcg 

0.5116 

(+45.34%) 

0.6173 (+36.24%) 

0.4679 

(+49.82%) 

SVMiVBBP 

0.5112 

(+45.23%) 

0.6129 (+35.27%) 

0.4691 

(+50.21%) 



learning a better relevance function, one can achieve better 
performance in diversification. In fact, even pure super¬ 
vised relevance method ListMLE, can achieve comparable 
performance with the explicit methods under unsupervised- 
relevance versions (i.e. xQuADqu and PM-2Qi), which fur¬ 
ther proves the importance of a proper relevance function 
even in a diversification scenario. 

(2) Learning-based methods (i.e. SVMDIV and SVMdcbm) 
further outperform the the state-of-the-art explicit meth¬ 
ods in terms of all the evaluation measures. For exam¬ 
ple, with the evaluation of ERR-IA, the relative improve¬ 
ment of SVMc-NDca over the xQuADjist is up to 17.16%, 
12.27%, 10.31%, on WT2009, WT2010, WT2011, respec¬ 
tively, and the relative improvement of SVMq-jvdcg over 
the PM-2ii,,t is up to 18.51%, 11.37%, 6.9% on WT2009, 
WT2010, WT2011, respectively. Although xQuADust and 
PM-2;ist all utilize the official subtopics as explicit query 
aspects to simulate their best-case scenarios, their perfor¬ 
mances are still much lower than learning-based approaches, 
which indicates that there might be certain gap between 
their predefined utility functions and the final evaluation 
measures. 


Table 6: The robustness of the performance of all 
diversification methods in Win/Loss ratio. 



WT2009 

WT2010 

WT2011 

Total 

MMRqu 

18/20 

21/15 

19/17 

58/52 

xQuADqu 

25/16 

29/16 

28/11 

83/38 

PM-2qe 

18/19 

30/16 

30/12 


ListMLE 

20/18 

27/16 

26/11 

73/45 

MMRiist 

22/15 

29/13 

29/10 

80/38 

xQuADust 

28/11 

31/12 

31/12 

90/35 

PM-2;isf 

26/15 

32/12 

32/11 

90/38 

SVMDIV 

30/12 

32/11 

32/11 

94/34 

SVM_b_r_r-/a 

33/10 

34/11 

33/10 

100/31 

SVMc-iVECG 

33/10 

32/9 

34/11 

99/30 

SVMjvesp 

32/11 

33/10 

34/11 

99/32 


robustness than their corresponding unsupervised-relevance 
versions, which is also consistent with the evaluation re¬ 
sults in Table mil xQuAD performs better than PM- 
2 no matter supervised-relevance version or unsupervised- 


relevance version. Among all the diversification baselines, 
(3) Comparing with the learning-based methods, our SVMdcba/SVMDIV shows the best performance robustness with the 


approaches all outperform the SVMDIV method. The rela¬ 
tive improvement of SYMc-ndcg over the SVMDIV is up 
to 11.54%, 9.6%, 6.19%, in terms of ERR-IA on WT2009, 

WT2010, WT2011, respectively. We further validate that all 
these improvements are statistically significant (p-value < 

0.01). As we know, SVMDIV simply uses weighted word 
coverage as a proxy for explicitly covering subtopics, while 
our SYMdcem jointly modeling relevance and diversity based 
on a proper bi-criteria objective. Therefore, our SVMocem 
approach shows better formulation of diverse ranking, and 
leads to better performance in search result diversification. 

(4) Not surprisingly, the method optimizing an evaluation 
metric leads to the best performance with respect to the 
corresponding evaluation metric. For example, SYMerr-ia 
performs best with ERR-IA as the evaluation measure, SVMa-jvD(&4 
performs best with a-NDCG as the evaluation measure, 
and SYMnrbp performs best with NRBP as the evalua¬ 
tion measure, in all the three datasets, and the results are 
accordance with our intuition. 

In addition, we also evaluate these diversity methods in 
traditional diversity measures; Precision-IA and Subtopic 
recall, and experimental results are shown in Fig. 3 and 4. 

We can see that our approaches outperform all the baseline 
models in all the datasets, which is consistent with the evalu¬ 
ation results in Table [3[4|5| When comparing the 3 variants 
of SVMdcbm approach, SYMerr-ia and SVMc-vdcg per¬ 
form a little better than SYM-nrbp, yet their overall per¬ 
formance differences are small. 


total Win/Loss ratio around 2.8. Finally, our SVM_dcbm 
methods achieve the best robustness as compared with all 
the baseline methods, with the total Win/Loss ratio around 
3.2. Among the three variants of SVMocem, SVMq-atocg 
performs a little better than the two others, with the Win/Loss 
ratio as 3.3. 

Based on the robustness results, we can see that the per¬ 
formance of our SVMocem methods are more stable than 
all the baseline methods. It demonstrates that the overall 
performance gains of our approach not only come from some 
small subset of queries. In other words, the result diversi¬ 
fication for different queries could be well addressed in our 
approach. 

Approximate Constraint Generation 

In our work, we use an approximate way of constraint 
generation for model training, which may compromise our 
models’ ability to fit the data. Similar to the study in [36| , 
we address this concern by examining the training loss as 
C is varied. A high value of C indicates the training model 
favors low training loss over low model complexity. 

We choose WT2011 as an example, and the training curves 
of our three models are shown in Figure Obviously, with 
the increasing of C, all three models are able to fit the train¬ 
ing data almost perfectly. This indicates that our approx¬ 
imate constraint generation is acceptable for training pur¬ 
pose. The results for the other two datasets are similar, and 
we do not show them here due to space limitation. 


6.3 Robustness Analysis 

In this section we analyze the robustness of these diversifi¬ 
cation methods, i.e. whether the performance improvement 
is consistent as compared with the basic relevance baseline 
model [^. Specifically, we define the robustness as the 
Win/Loss ratio [36[ [m - the ratio of queries whose per¬ 
formance improves or hurts as compared with the original 
results from QL in terms of of ERR-IA. 

From results in Table we first notice that for the im¬ 
plicit and explicit methods, their supervised-relevance ver¬ 
sions (i.e. MMRjist, xQuADjist and PM- 2 ii 3 t) show better 


6.5 Feature Importance Analysis 

In this subsection, we will give some analysis on the impor¬ 
tance of the proposed diversity features. Table shows the 
order list of features used in our learned model (SVMc-iVDGG) 
according to the learned weight values (average on three 
datasets). From the results, we can see that the and 

4’dtopio have been shown to be the most important, which is 
in accordance with our intuition that diversity mainly lies in 
the rich semantic information. Meanwhile, the title and an¬ 
chor text diversity <j>dtiti^ and also work well, since 

these fields typically provide a precise summary of the con- 
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Figure 3: Performance comparison of all methods in Precision-IA for WT2009, WT2010 and WT2011. 
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Figure 4: Performance comparison of all methods in Subtopic Recall for WT2009, WT2010 and WT2011. 


Table 8: Average training time of different ap¬ 
proaches. 


methods 

ListMLE 

SVMDIV SVMocem 

time (hours) 

1.5 

2 2.5 



Figure 5: Training loss comparing C values on 
WT2011 dataset. 


tent of the document. Finally, The Link and URL based 
diversity and 4>d^^i seem to be the least important 

features, which may be due to the sparsity of such types of 
features in the data. 

As a learning-based method, our model is flexible to in¬ 
corporate different types of features for capturing both the 
relevance and diversity. Therefore, it would be interesting to 
explore more other useful features to further improve perfor¬ 
mance of the diverse ranking. We will investigate the issue 
in future. 

6.6 Running Time Analysis 

We further study the efficiency of our approach and the 
baseline models. All of the diversification methods (includ¬ 
ing the baseline models and our approach) associate with 
a greedy selection process, which is time-consuming due to 
the consideration of the dependency relations of document 
pairs. Assuming that the size of output rankings is K, the 
size of candidate set is n, then this type of greedy selection 
based on maximizing a certain marginal gain, like Algorithm 
1, will have time complexity of 0{n * K). With a small K, 
the running time is linear. 

All the learning-based methods (i.e. ListMLE, SVMDIV 
and SVMdcbm) need additional offline training time due 
to the supervised learning process. We compare the average 
training time of different learning-based methods, and the 
result is shown as Table [S] 





































































Table 7: Order list of diversity features with corresponding weight value. 


feature 

<t>dad„ 

<l>dtopio 

4'dtitlc 



4'diir,k 

4>d„,.l 

weight 

2.82987 

2.75189 

0.95001 

0.87450 

0.82735 

0.06727 

0.04800 


We can observe that our approach takes longer but com¬ 
parable offline training time among different learning-based 
methods. Besides, in our experiments, we also found that 
the three variants of our SYMbcem approach are with nearly 
the same training time. We will attempt to optimize our 
code to provide much faster training speed in the following 
work. 


7. CONCLUSIONS 

In this paper, we propose a unified structural learning 
framework for simultaneously optimizing both relevance and 
diversity. Firstly, we propose to directly use the diversity- 
correlated IR evaluation measures as the objective functions, 
such as ERR-IA, a-NDCG and NRBP. Secondly, we define 
the discriminant function based on a bi-criteria objective to 
give consideration of both relevance and diversity. Thirdly, 
we propose and utilize a series of useful diversity-based fea¬ 
tures to facilitate the learning process. Finally, we demon¬ 
strate empirically that our approach can significantly out¬ 
perform the state-of-the-art methods on the public TREC 
datasets in all kinds of evaluation measures, and show better 
performance robustness. 

Learning to optimize both relevance and diversity is an in¬ 
teresting direction. As for future work, we plan to take diver¬ 
sity into the consideration of the goal of traditional learning- 
to-rank framework. For example, we can add diversity-based 
score into the listwise loss functions 35 , to obtain a global 


ranked list which incorporates both relevance and diversity. 
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